# Data Ingestion
# Dictionary
# Create dictionary
The creation of a dictionary consists of indicating the variables that we are going to save in our table together with the type of each one of the data, that is to say, the structure that our ingestion is going to follow. To access the section, click on the three stripes at the top left of the menu screen, select ingests, dictionaries and finally create a dictionary. In addition, for the creation of a dictionary it is necessary to previously indicate a version and a dictionary name. The steps to follow for the creation of a dictionary are:
- Name of the new dictionary: This name will be used later in the ingestion to identify the structure that the data will follow.
- Name of the version: This name will change according to small changes that we can make. It is important to know that when choosing the name of the dictionary that you are going to use in the ingestion, you must also indicate the version that you are going to use.
- Description: Its purpose is to know the subject of ingestion with a short description of the subject of the variables that we have introduced.
- Insert the fields: This is the most important step, it consists of inserting all the variables. In this section we must indicate:
- Variable name: It will be the name of the column generated in the ingestion.
- Type of the variable: In this option we must indicate the type of data that we are going to save in each one of the variables. We can find different options:
- None: This type of data will be those with which we are not going to perform any calculation later nor will it be the identifier.
- ts_index: This variable will be the identifier of the ingestion. There can only be one per dictionary. In addition, Knolar only permits the variable of this section to be a date.
- metric: This type of variables will be used to perform calculations later on.
The different options that we can find when selecting the type of data that each variable is going to be are:
- Double: means that the precision of these numbers is twice the precision of float numbers.
- Float: are used to represent decimal numbers also as integers with larger order of magnitude.
- Long: refers to whole numbers without decimals that lie within the range
- String: is used to store character strings. A character string is a succession of characters in Unicode format with a leading zero.
- Timestamp: data type containing dates
# Dictionaries management
This option is used to view or modify the dictionaries already created. To access the section, click on the three stripes at the top left of the menu screen, select ingests, dictionaries and finally manage dictionary. Some of the changes that can be made are:
- View the dictionaries that have been created and their characteristics.
- Create a new version
- Add variables
- Delete variables
- Modify the type of existing variables: In order to do this, it is necessary to delete the variable and recreate it again by changing the type.
- Delete the dictionary
# Contextualization
The main objective is to give us more information about the data I am going to ingest. Its use is reflected in real time and batch ingestions where it helps to select only those data that we need to ingest. In order to apply this tool, it is necessary to fill in the following three fields
# Sites
When you access you will see two different sections. The first one is a list of all the sites that have been created and the second one is the section where you can create new sites adding only the code of the site you are going to create and a short description of it. When you create a new site, you are only indicating the different subsections that a dataset can have.
# Sources
When you access you will see two different sections. The first one is a list of all the sources that have been created and the second one is the section where you can create new sources adding only the code of the source you are going to create and a short description of it. When you create a new source, you are only indicating the different subsections that a given site can have, that is to say, of the whole set of data we are talking about the second subsection that is made.
# Prefixes
When you access you will see two different sections. The first one is a list of all the prefixes that have been created and the second one is the section where you can create new prefixes indicating only the code of the site and the source that you are going to join and a short description of it. When you create a new prefix, you are only joining the selected source and site, that is, you create a different value for each combination of site and source you want.
# Ingestions
This section consists of inserting values into our tool in order to analyze them and be able to reach conclusions with them. To access the section, click on the three stripes at the top left of the menu screen, select ingests, ingests and finally create an ingest. There are three different ways to ingest data, each type must perform different steps
# Real Time
Real-time ingestions are those that are continuously uploading data.
Configuration
- Name: This name will be the one we will use later to search for the ingestion when we want to modify or view the ingestion. Usually this name is not so technical
- Alias: It is usually a more technical name, this name will be the same as the name of the table generated in the ingestion.
- Version: You must indicate the type of version that you are carrying out, depending on how you make different modifications the version will be changing.
- Description: It consists of explaining briefly the data that we are ingestioning so that, with simply reading this section we are able to know the ingestion in question.
- Roles: You must select the roles of the employees to whom you want to give permission on this ingestion, otherwise only the creator will be able to see it. Also, you can assign them later by accessing the roles panel.
- Tags: In this section you can add the tags you want to help you explain the ingestion.
Dictionary
- Select the dictionary: you must indicate the dictionary and the version that the ingestion data will follow. You can also see more details of the dictionary (the variables that compose it and the type of each one of them) to confirm that it is the correct dictionary.
Contextualization The objective of this section is to filter the number of data to be loaded in the ingest. To do this, there is the possibility to do it in the following ways
- Set: It is a way to ingest only those data that match the sites and sources indicated in the section itself.
- Filter: In this section, only the data that, based on the dictionary field and its corresponding delimiter, match the format that the prefixes should have, will be ingested.
- Skip: None of the options will be taken into account and the complete data will be ingested.
Event
- Ingestion volume: Indicate the number of publications made per second.
- Example event: You must upload an example file. This file must be in JSON format or you can edit it manually following also the JSON format.
Mapping
- The purpose of this section is: to link each value of the example file with its corresponding variable of the dictionary previously selected. For this, we can find three different sections. One with the example file, another with the dictionary variables and the last one is an overview of the results obtained after the union.
Other
- Enrichment: This process consists of adding variables to our table in order to help us understand the data better. To perform this process it is important to take into account that the variables that can be added to the ingestion only come from a previous metadata ingestion process, that is, before using the variable as enrichment it has to be ingested under the metadata format. The steps to be followed to perform the enrichment correctly are:
- Make the ingestion of metadata with these variables.
- Choose the dictionary field that they have in common.
- Select the table with which you want to enrich the ingestion, that is, indicate the table that we have ingested as metadata.
- Transformations: This section is organized in three different parts:
- Dictionary: in this section we can see all the variables that are part of the dictionary selected for the ingestion.
- Transformations: There are three different types of transformations:
- Transformations of primitive types: This section is used to modify the types of variables. The most common example that we are going to find is to change those variables that are not STRING by their original format, as we have mentioned before. When loading the file, all the default variables will be interpreted as STRINGS, in case that variable is not a STRING. This is the section to modify it.
- Date transformations: This section consists of modifying the given date format. To do this, it will be necessary to enter the date format to which I want to change the value of the field and choose the time zone in which the date is located.
- Advanced transformations: These types of changes are based on creating a function that modifies the value of the variable.
- Previous visualization: Where you can see an overview of the different changes we have made to the variables.
- Enrichment: This process consists of adding variables to our table in order to help us understand the data better. To perform this process it is important to take into account that the variables that can be added to the ingestion only come from a previous metadata ingestion process, that is, before using the variable as enrichment it has to be ingested under the metadata format. The steps to be followed to perform the enrichment correctly are:
Active the ingestion When you have completed all the above options, it is time to activate the ingestion. To do this, you must scroll down to the bottom of the page and activate the resources necessary for the creation of money, these resources are directly related to the cost involved in the infrastructure. The next step is to activate the ingestion and after a loading process the ingestion will be successfully completed.
# Batch
Batch ingestions are those that load data periodically. In the case of knolar, the loading is done every hour.
Configuration
- Name: This name will be the one we will use later to search for the ingestion when we want to modify or view the ingestion. Usually this name is not so technical
- Alias: It is usually a more technical name, this name will be the same as the name of the table generated in the ingestion.
- Version: You must indicate the type of version that you are carrying out, depending on how you make different modifications the version will be changing.
- Descrption: It consists of explaining briefly the data that we are ingestioning so that, with simply reading this section we are able to know the ingestion in question.
- Email: In this email a message will be notified when the ingestion has been created with information about it.
- Roles: You must select the roles of the employees to whom you want to give permission on this ingestion, otherwise only the creator will be able to see it. Also, you can assign them later by accessing the roles panel.
- Tags: In this section you can add the tags you want to help you explain the ingestion.
Dictionary
- Select the dictionary: you must indicate the dictionary and the version that the ingestion data will follow. You can also see more details of the dictionary (the variables that compose it and the type of each one of them) to confirm that it is the correct dictionary.
Contextualization The objective of this section is to filter the number of data to be loaded in the ingest. To do this, there is the possibility to do it in the following ways
- Set: It is a way to ingest only those data that match the sites and sources indicated in the section itself.
- Filter: In this section, only the data that, based on the dictionary field and its corresponding delimiter, match the format that the prefixes should have, will be ingested.
- Skip: None of the options will be taken into account and the complete data will be ingested.
Event
- Select the type of file you will upload as an example in the ingestion: These files can be in CSV, JSON and JSONL formats.
- Select whether or not the file you are going to upload has a header: If yes, it is important to take into account the restrictions that knolar presents, that is, the headers can only have letters A-Z or a-z, underscores (_) and hyphens (-), they can contain numbers but never only numbers nor start with a number and they cannot contain spaces.
- Upload the file: The file must be previously in our computer and must comply with the format and header conditions previously indicated. When uploading the file, a warning will appear indicating that all the data will be interpreted as Strings. If this is not correct, click on the 'continue' button and change the format in the transformations section.
- Indicate the separator: Each file has a different separator between variables, in this section you must indicate which one it is.
Mapping
- The purpose of this section is: to link each value of the example file with its corresponding variable of the dictionary previously selected. For this, we can find three different sections. One with the example file, another with the dictionary variables and the last one is an overview of the results obtained after the union.
Other
- Enrichment: This process consists of adding variables to our table in order to help us understand the data better. To perform this process it is important to take into account that the variables that can be added to the ingestion only come from a previous metadata ingestion process, that is, before using the variable as enrichment it has to be ingested under the metadata format. The steps to be followed to perform the enrichment correctly are:
- Make the ingestion of metadata with these variables.
- Choose the dictionary field that they have in common.
- Select the table with which you want to enrich the ingestion, that is, indicate the table that we have ingested as metadata.
- Transformations: This section is organized in three different parts:
- Dictionary: in this section we can see all the variables that are part of the dictionary selected for the ingestion.
- Transformations: There are three different types of transformations:
- Transformations of primitive types: This section is used to modify the types of variables. The most common example that we are going to find is to change those variables that are not STRING by their original format, as we have mentioned before. When loading the file, all the default variables will be interpreted as STRINGS, in case that variable is not a STRING. This is the section to modify it.
- Date transformations: This section consists of modifying the given date format. To do this, it will be necessary to enter the date format to which I want to change the value of the field and choose the time zone in which the date is located.
- Advanced transformations: These types of changes are based on creating a function that modifies the value of the variable.
- Previous visualization: Where you can see an overview of the different changes we have made to the variables.
- Enrichment: This process consists of adding variables to our table in order to help us understand the data better. To perform this process it is important to take into account that the variables that can be added to the ingestion only come from a previous metadata ingestion process, that is, before using the variable as enrichment it has to be ingested under the metadata format. The steps to be followed to perform the enrichment correctly are:
Active the ingestion When you have completed all the above options, it is time to activate the ingestion. To do this, you must scroll down to the bottom of the page and activate the resources necessary for the creation of money, these resources are directly related to the cost involved in the infrastructure. The next step is to activate the ingestion and after a loading process the ingestion will be successfully completed.
# Metadata
These ingestions are created with the main purpose of enriching the other two types of ingestions (Bach ingestion and real-time ingestion). Enriching an ingestion in Knolar is the same as adding information to the ingestion. Metadata ingestions are static, once created they will not increment in volume, nor will the data have to be reloaded.
Configuration
- Name: This name will be the one we will use later to search for the ingestion when we want to modify or view the ingestion. Usually this name is not so technical
- Alias: It is usually a more technical name, this name will be the same as the name of the table generated in the ingestion.
- Descrption: It consists of explaining briefly the data that we are ingestioning so that, with simply reading this section we are able to know the ingestion in question.
- Email: In this email a message will be notified when the ingestion has been created with information about it.
- Roles: You must select the roles of the employees to whom you want to give permission on this ingestion, otherwise only the creator will be able to see it. Also, you can assign them later by accessing the roles panel.
- Tags: In this section you can add the tags you want to help you explain the ingestion.
Event
- Upload file: A sample file must be uploaded. The file must be in CSV format. In case it has headers, it is important to take into account the restrictions that knolar presents, these are: the headers can only have letters A-Z or a-z, underscores (_) and hyphens (-), they can contain numbers but never only numbers nor begin with a number and they cannot contain spaces.
- Indicate the separator: Each file has a different separator between variables, in this section you must indicate which one it is.
- Index: In this section you must indicate the column with which the enrichment is going to be done in the rest of ingestions, that is to say, the union column.However, it is important to take into account the restrictions that Knolar prescribes: the index must be the first column of the CSV files that are uploaded, the CSV must have a row of headers with the names of each column and the column selected as index column must coincide with the name of the first column of the file.
Datatypes setting In this section, you only have to confirm if the ingest process has been done correctly and adjust the format of each value. This adjustment is not obligatory, Knolar makes an estimation and gives a value to each field, in case this value is not the correct one it will be when we proceed to adjust it manually.
Active the ingestion When you have completed all the above options, it is time to activate the ingestion. To do this, you must scroll down to the bottom of the page and activate the resources necessary for the creation of money, these resources are directly related to the cost involved in the infrastructure. The next step is to activate the ingestion and after a loading process the ingestion will be successfully completed.
# Ingestions management
This option is used to view or modify the ingest already created. To access the section, click on the three stripes at the top left of the menu screen, select ingestions, ingestions and finally ingestions management. Some of the changes that can be made are:
- Activate or deactivate ingestion
- Delete the ingestion
- Create a new version: The changes allowed are related to the mapping, enrichment and transformations sections.
It is important to note that, once a first version of the ingestion has been created, it is not possible to modify the field type or reduce the number of fields to be ingested in a new version. In this case, a new ingestion must be created. The ingestion management is used to modify the steps performed mainly in the mapping or transformations sections as long as the modification meets the criteria mentioned above.